The data used in this example project is the titanic dataset which is a very common and widely used dataset and therefore also quite uninteresting as subject to an extensive data analysis or machine learning. But the version of the dataset that I have required suits the purpose of illustrating my data cleaning capabilities very well because it consists of different parts that have to be put together and contains problematic values of different kinds which have to be handled.

Imports and display setup:

In [19]:

import pandas as pd

Connecting the different parts of the dataset¶

Connecting part 1 and part 2:

In [20]:

part_1_df = pd.read_excel("./data/datasæt del1.xlsx")
part_2_df = pd.read_csv("./data/datasæt del2 komma.csv")

titanic_df = pd.concat([part_1_df, part_2_df])
titanic_df.reset_index(drop=True, inplace=True)
titanic_df.tail()

Out[20]:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	case_id
1308	3	0.0	female	NaN	0	0	65305	8.1125	NaN	S	NaN	NaN	NaN	1308
1309	3	0.0	male	0.75	1	1	SOTON/O.Q. 3101315	13.775	NaN	S	NaN	NaN	NaN	1309
1310	3	1.0	male	25.00	0	0	345768	9.5	NaN	S	11	NaN	NaN	1310
1311	3	1.0	male	14.00	0	0	7538	9.225	NaN	S	13	NaN	NaN	1311
1312	3	0.0	male	25.00	1	0	347076	7.775	NaN	S	NaN	NaN	NaN	1312

Adding names to the dataframe:

In [21]:

names_txt = pd.read_csv('./data/navne.txt', sep=",", header=None)
names_column = names_txt.loc[1:,1].reset_index(drop=True)

titanic_df['name'] = names_column
titanic_df.head()

Out[21]:

	pclass	sex	age	sibsp	ticket	fare	cabin	embarked	boat	body	homedest	case_id	name
0	1	male	67.0	1	PC 17483	221.779	C55 C57	S	NaN	96.0	New York, NY	1	Straus, Mr. Isidor
1	1	male	46.0	0	694	26	NaN	S	NaN	80.0	Bennington, VT	2	Jones, Mr. Charles Cresson
2	1	male	50.0	1	PC 17761	106.425	C86	C	NaN	62.0	Deephaven, MN / Cedar Rapids, IA	3	Douglas, Mr. Walter Donald
3	1	male	61.0	0	36963	32.3208	D50	S	NaN	46.0	Haddenfield, NJ	4	Sutton, Mr. Frederick
4	1	male	32.5	0	113503	211.5	C132	C	NaN	45.0	NaN	5	Keeping, Mr. Edwin

Setting case_id as the index of the dataframe:

In [22]:

titanic_df.set_index('case_id', inplace=True)

Data cleaning¶

Removing duplicates in the dataset:

In [23]:

print('Locating duplicates:')
display(titanic_df[titanic_df.duplicated()])
print('\n Duplicates before:')
display(titanic_df.loc[550:551])
titanic_df = titanic_df.drop_duplicates()
# Showing the duplicate
print('\n Duplicates after:')
display(titanic_df.loc[550:551])

Locating duplicates:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id
550	2	0.0	male	NaN	0	0	239853	0	NaN	S	NaN	NaN	Belfast	Cunningham, Mr. Alfred Fleming

 Duplicates before:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id
550	2	0.0	male	NaN	0	0	239853	0	NaN	S	NaN	NaN	Belfast	Cunningham, Mr. Alfred Fleming
550	2	0.0	male	NaN	0	0	239853	0	NaN	S	NaN	NaN	Belfast	Cunningham, Mr. Alfred Fleming
551	2	0.0	male	35.0	0	0	233734	12.35	NaN	Q	NaN	NaN	NaN	Keane, Mr. Daniel

 Duplicates after:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id
550	2	0.0	male	NaN	0	0	239853	0	NaN	S	NaN	NaN	Belfast	Cunningham, Mr. Alfred Fleming
551	2	0.0	male	35.0	0	0	233734	12.35	NaN	Q	NaN	NaN	NaN	Keane, Mr. Daniel

Removing row which contains too high fare value:

In [24]:

# Change the type in the fare column to float 
titanic_df['fare'].replace('.', 0.0, inplace=True)
numeric_fare = pd.to_numeric(titanic_df['fare'])
titanic_df.loc[:, 'fare'] = numeric_fare

# Finding the row with the maximum fare value and removing it from the dataframe
print('Max fare value before:')
display(titanic_df.loc[titanic_df['fare'] == titanic_df['fare'].max()])
titanic_df.drop(60, axis=0, inplace=True)
print('\n Max fare value after:')
display(titanic_df.loc[titanic_df['fare'] == titanic_df['fare'].max()])

Max fare value before:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id
60	1	1.0	female	17.0	0	1	12345	25000.0	NaN	S	NaN	NaN	Philadelphia, PA	Bukater, Miss. Rose Dewitt

 Max fare value after:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id
150	1	1.0	male	36.0	0	1	PC 17755	512.3292	B51 B53 B55	C	3	NaN	Austria-Hungary / Germantown, Philadelphia, PA	Cardeza, Mr. Thomas Drake Martinez
199	1	1.0	male	35.0	0	0	PC 17755	512.3292	B101	C	3	NaN	NaN	Lesurer, Mr. Gustave J
267	1	1.0	female	35.0	0	0	PC 17755	512.3292	NaN	C	3	NaN	NaN	Ward, Miss. Anna
296	1	1.0	female	58.0	0	1	PC 17755	512.3292	B51 B53 B55	C	3	NaN	Germantown, Philadelphia, PA	Cardeza, Mrs. James Warburton Martinez (Charlo...

Removing datasource where the death is hard to determine:

In [25]:

# Locate values which are different than 0 and 1 in the survived column
print('Values different than 0 or 1 in survived column before:')
display(titanic_df.loc[~titanic_df['survived'].isin([1,0])])
titanic_df.drop(325, axis=0, inplace=True)
print('\n Values different than 0 or 1 in survived column after:')
display(titanic_df.loc[~titanic_df['survived'].isin([1,0])])

Values different than 0 or 1 in survived column before:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id
325	1	NaN	male	68.0	0	0	36010	10.0	NaN	S	NaN	NaN	New York City, NY	Wonder, Mr. Stevie

 Values different than 0 or 1 in survived column after:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id

Removing data scource where age is below 0:

In [26]:

print('Value in age column below 0 before:')
display(titanic_df.loc[titanic_df['age'] < 0])
titanic_df.drop(172, axis=0, inplace=True)
print('\n Value in age column below 0 after:')
display(titanic_df.loc[titanic_df['age'] < 0])

Value in age column below 0 before:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id
172	1	1.0	male	-62.0	0	0	24161	10.0	Shutter Island	S	NaN	NaN	Los Angeles, CA	DiCaprio, Mr. Leonardo

 Value in age column below 0 after:

	pclass	survived	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	homedest	name
case_id

Further cleaning of data¶

Overview of the cleaned dataframe as it is now:

In [27]:

print('Summary info about the dataframe:\n')
display(titanic_df.info())
print('\n Data types:')
display(titanic_df.dtypes)

Summary info about the dataframe:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1312
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   float64
 2   sex       1309 non-null   object 
 3   age       1046 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   ticket    1309 non-null   object 
 7   fare      1309 non-null   float64
 8   cabin     295 non-null    object 
 9   embarked  1307 non-null   object 
 10  boat      486 non-null    object 
 11  body      121 non-null    float64
 12  homedest  745 non-null    object 
 13  name      1309 non-null   object 
dtypes: float64(4), int64(3), object(7)
memory usage: 153.4+ KB

None

 Data types:

pclass        int64
survived    float64
sex          object
age         float64
sibsp         int64
parch         int64
ticket       object
fare        float64
cabin        object
embarked     object
boat         object
body        float64
homedest     object
name         object
dtype: object

Turning sibsp (sibling/spouse) and parch (parent/child) columns into one column called family and removing the individual columns:

In [28]:

titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1
titanic_df = titanic_df.drop('sibsp', axis=1)
titanic_df = titanic_df.drop('parch', axis=1)
display(titanic_df.head())

	pclass	survived	sex	age	ticket	fare	cabin	embarked	boat	body	homedest	name	family_size
case_id
1	1	0.0	male	67.0	PC 17483	221.7792	C55 C57	S	NaN	96.0	New York, NY	Straus, Mr. Isidor	2
2	1	0.0	male	46.0	694	26.0000	NaN	S	NaN	80.0	Bennington, VT	Jones, Mr. Charles Cresson	1
3	1	0.0	male	50.0	PC 17761	106.4250	C86	C	NaN	62.0	Deephaven, MN / Cedar Rapids, IA	Douglas, Mr. Walter Donald	2
4	1	0.0	male	61.0	36963	32.3208	D50	S	NaN	46.0	Haddenfield, NJ	Sutton, Mr. Frederick	1
5	1	0.0	male	32.5	113503	211.5000	C132	C	NaN	45.0	NaN	Keeping, Mr. Edwin	1

Locating NaN values in the dataset:

In [29]:

titanic_df.isnull().sum()

Out[29]:

pclass            0
survived          0
sex               0
age             263
ticket            0
fare              0
cabin          1014
embarked          2
boat            823
body           1188
homedest        564
name              0
family_size       0
dtype: int64

Removing the cabin, boat, and body columns because it contains way too many NaN values:

In [30]:

titanic_df.drop(columns='cabin', inplace=True)
titanic_df.drop(columns='boat', inplace=True)
titanic_df.drop(columns='body', inplace=True)

Replacing NaN values in age col with median values:

In [31]:

titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].median())

Replacing NaN values in embarked col with embarked value with highest frequency:

In [32]:

display(titanic_df['embarked'].mode())
titanic_df['embarked'] = titanic_df['embarked'].fillna('S')

0    S
dtype: object

Creating adult col to indicate if the person is an adult or a child:

In [33]:

titanic_df['adult'] = 0
titanic_df['adult'][titanic_df['age'] >= 18] = 1

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Renaming pclass column to something that is more interpretable:

In [34]:

titanic_df.rename(columns={'pclass': 'price_class'}, inplace=True)

The cleaned dataset now looks like this:

In [35]:

titanic_df.head()

Out[35]:

	price_class	survived	sex	age	ticket	fare	embarked	homedest	name	family_size	adult
case_id
1	1	0.0	male	67.0	PC 17483	221.7792	S	New York, NY	Straus, Mr. Isidor	2	1
2	1	0.0	male	46.0	694	26.0000	S	Bennington, VT	Jones, Mr. Charles Cresson	1	1
3	1	0.0	male	50.0	PC 17761	106.4250	C	Deephaven, MN / Cedar Rapids, IA	Douglas, Mr. Walter Donald	2	1
4	1	0.0	male	61.0	36963	32.3208	S	Haddenfield, NJ	Sutton, Mr. Frederick	1	1
5	1	0.0	male	32.5	113503	211.5000	C	NaN	Keeping, Mr. Edwin	1	1